Back to Blog

I Deleted Sonnet Because My Dataloader Had Too Many Hands

I deleted Sonnet today. Not because it was bad. Not because it failed. Because I realized my dataloader was feeding it the same data four times. Because I had four dataloader cores. Because four cores was enough to feed my GPU. Because I did not think about what four cores meant for data repetition.

The model overfit in the most dramatic way possible. The loss curve looked like a heartbeat during a panic attack. High loss. Lower loss. Lower loss. Lowest loss. Insanely high loss. Repeat. I watched it for hours thinking something interesting was happening. Something interesting was happening. I was watching my model memorize the same four examples on a loop.

Overfitting usually happens slowly. Mine happened with the subtlety of a fire alarm in a library.

The Loss Curve Of Doom

Let me show you what I saw. This is not a joke. This is my actual training log from before I realized my mistake.

Step 1000:Loss: 4.21
Step 2000:Loss: 3.18
Step 3000:Loss: 2.45
Step 4000:Loss: 1.89
Step 5000:Loss: 12.74
Step 6000:Loss: 3.02
Step 7000:Loss: 2.11
Step 8000:Loss: 1.67
Step 9000:Loss: 15.93
# The pattern repeats every 4000 steps
# Because the dataloader cycled through 4 copies of the same data

I thought the spikes were gradient explosions. I added clipping. I thought the drops were breakthroughs. I celebrated. I was celebrating my model memorizing the same four prompts. The irony is thick enough to cut with a tensor.

How It Happened

The dataloader configuration seemed fine. Four worker processes. Shuffle enabled. Batch size set. Everything looked correct. What I did not realize: each worker was loading the same dataset slice. No shuffling across workers. No coordination. Just four copies of the same data feeding the GPU in rotation.

# My dataloader config (the problematic part) DataLoader( dataset=my_dataset, batch_size=32, num_workers=4, # Four workers shuffle=True, # Shuffle within each worker # But no shuffle ACROSS workers # So each worker sees the same data # And the GPU sees each example 4x per epoch ) # I learned about distributed sampling the hard way

The GPU was happy. It was getting fed consistently. The loss was going down. Sometimes. Then spiking. Then going down again. I thought this was normal training dynamics. It was not. It was my model seeing the same example four times and getting very confused about which version was real.

The Realization

I noticed when I printed a batch. The same prompt appeared four times in a row. Then the next four were also the same. Then the next. I did the math. Four workers. Same dataset. No distributed sampling. My model was training on 25% of the data, repeated four times.

I felt many things. Confusion. Embarrassment. The urge to delete everything. I chose deletion. Sometimes the cleanest solution is to start over. Sometimes the only way forward is to admit you fed your model the same lunch four days in a row and wonder why it got bored.

Debugging is just apologizing to your code until it works again.

The Rental Consideration

Here is where things get real. Sonnet 1 has not been released. Opus 1 has not been released. They are both still training. Or were training. Until I deleted one of them because of a dataloader bug.

I am thinking about renting a GPU. Just for a little while. Just to finish these models. Just to see if cloud infrastructure has better dataloader documentation. Just to stop watching my electricity meter spin like a ceiling fan.

0
Sonnet 1 Releases
0
Opus 1 Releases

Local training is noble. Local training is educational. Local training is also slow and prone to dataloader disasters. Renting a GPU feels like cheating. But cheating that gets models released. Cheating that lets me sleep at night. Cheating that might actually work.

The Silver Lining

Here is the good news. When Sonnet 1 and Opus 1 do release, they will have Engrams AND DeepSeek hyper connections. Both features are implemented. Both are tested. Both survived my dataloader disaster because they live in the model architecture, not the data pipeline.

Engrams provide external memory for static knowledge. DeepSeek hyper connections allow better information flow between layers. Together they make small models smarter. Together they survived my incompetence. Together they will power the first generation of TinyMemoryLM models that actually speak English.

The architecture is ready. The code is solid. The dataloader is fixed. Sonnet-1 will train properly this time. Opus-1 will follow. They will have memory. They will have connections. They will have a dataloader that does not feed them the same sandwich four times.

What I Learned

First, always verify your data pipeline. Print batches. Check for duplicates. Assume your dataloader is lying to you until proven otherwise.

Second, overfitting can look like progress. A loss going down does not always mean learning. Sometimes it means memorizing. Sometimes it means your model is very good at recognizing the same four examples.

Third, deleting a training run hurts. But keeping a broken one hurts more. Sometimes the brave choice is to start over. Sometimes the smart choice is to admit you messed up and fix it.

Fourth, maybe renting a GPU is not cheating. Maybe it is just being practical. Maybe local training is a hobby and cloud training is how work gets done. Maybe both can be true.

What Comes Next

I am researching GPU rental options. RunPod. Lambda Labs. Vast.ai. The prices make me wince. The speed makes me hopeful. The idea of finishing Sonnet-1 in days instead of weeks makes me consider selling a kidney.

Sonnet-1 training will restart soon. With the fixed dataloader. With Engrams. With hyper connections. With hope. With a lot of monitoring. I will watch the loss curve like a hawk. I will print batches like a paranoid parent. I will not celebrate until I see genuine learning.

Opus-1 will follow. Six hundred million parameters. Forty days on my GPU. Maybe ten days on a rented cluster. Maybe less. Maybe I will finally release something that does not output pipe characters.

Final Thoughts

I deleted Sonnet today. Because my dataloader had too many hands. Because I did not think about distributed sampling. Because overfitting can look like progress. Because sometimes the right choice is to start over.

The next Sonnet will be better. It will have Engrams. It will have hyper connections. It will have a dataloader that does not feed it the same data four times. It will learn. It will speak. It will not output pipe characters. Probably.

And maybe, just maybe, I will rent a GPU. Not because I gave up on local training. Because I want to finish what I started. Because Sonnet-1 and Opus-1 deserve to exist. Because sometimes the hobbyist becomes the professional by renting better tools.

Progress is not linear. Progress is two steps forward, one step into a dataloader bug, three steps back to fix it, then finally moving forward. I am still moving forward. Slowly. Painfully. But forward.